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Abstract 

With the growing focus on semantic searches and inter- 
pretations, an increasing number of standardized vocabu- 
laries and ontologies are being designed and used to de- 
scribe data. We investigate the querying of objects de- 
scribed by a tree-structured ontology. Specifically, we 
consider the case of finding the top-A: best pairs of ob- 
jects that have been annotated with terms from such an 
ontology when the object descriptions are available only 
at runtime. We consider three distance measures. The 
first one defines the object distance as the minimum pair- 
wise distance between the sets of terms describing them, 
and the second one defines the distance as the average 
pairwise term distance. The third and most useful dis- 
tance measure — earth mover's distance — finds the best 
way of matching the terms and computes the distance 
corresponding to this best matching. We develop lower 
bounds that can be aggregated progressively and utilize 
them to speed up the search for top-A: object pairs when 
the earth mover's distance is used. For the minimum 
pairwise distance, we devise an algorithm that runs in 
0{D + Tk log k) time, where D is the total information 
size and T is the total number of terms in the ontology. We 
also develop a novel best-first search strategy for the aver- 
age pairwise distance that utilizes lower bounds generated 
in an ordered manner. Experiments on real and synthetic 
datasets demonstrate the practicality and scalability of our 
algorithms. 

1 Introduction 

We are witnessing an unprecedented growth in annotated 
information. This growth has been motivated by a need to 
share information and, more recently, by a need to search 
and analyze objects based on their structure and seman- 



tics. Annotated objects occur in multiple application do- 
mains including language (http://wordnet.princeton.edu/), 
biology (http://w ww.geneontology.org), medical doc- 
uments (|http://www.nlm.nih.gov/mesh/), web content 
(http://www.semanticweb.org/), etc. In all these cases, an- 
notations are derived from a structured vocabulary or on- 
tology. An ontology uses a number of different relation- 
ships (e.g., is-a, is-part-of) to organize concepts or hierar- 
chies. 

This paper investigates the analysis of large sets of ob- 
jects that have been annotated with terms from a common 
ontology. The basic problem we consider is as follows: 
Given two sets of objects annotated with terms from a 
common ontology, how to find the top-fc pairs of objects 
among the two sets that are most similar 

The above problem statement requkes us to formalize 
the notion of distance between two terms in a given ontol- 
ogy and then to extend this notion to distance between two 
annotated objects. The distance between two terms can be 
measured by the shortest path distance on the ontology. 

There are a number of definitions for distance (or con- 
versely, similarity) between objects. Two obvious defini- 
tions are based on the minimum pairwise distance and the 
average pairwise distance between the annotations. The 
third one is the earth mover's distance [12j that takes into 
account the relative positions of the terms that describes 
the objects. We investigate querying based on these three 
distance measures. 

In this paper, we consider that the object descriptions 
are submitted in an online fashion, i.e., they are available 
only at run-time. As such, no pre-processing or index con- 
struction or any other offline processing can be used, and 
all the computation costs are paid at run-time. Even if 
the distance function used is a metric, the online nature 
of the problem renders the use of index structures like the 
M-tree Q infeasible due to their high index construction 
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Figure 1: Example of an ontology tree with objects. 

times. In a way, this problem is reminiscent of the com- 
putation of spatial joins on objects embedded in the Eu- 
clidean space: the spatial datasets are delivered online and 
we need to compute the best spatial matches |16|. Only 
that, in the case of ontologies, the primitive distance is not 
Euclidean, but computed on a tree. 

The problem we consider here can be extended easily 
to the case when objects are annotated with multiple in- 
dependent ontologies. We can compute the per-ontology 
distance and combine them using an aggregate ranking 
technique such as the threshold algorithm Q- The prob- 
lem of finding objects similar to a given query object (i.e., 
the fc-NN problem) reduces to the special case of a join of 
the database with a singleton set, the query object. Simi- 
larly, range queries can be solved by choosing only those 
pairs having a distance less than the query range. While 
these and other kinds of queries can also be considered in 
our setting, the problem of top-fc joins exposes the compu- 
tational and data management complexities of this domain 
well, making it the right problem to consider 

Formally, our problem can be stated as: 

Problem 1. Given a set of objects each of which is defined 
by a set of terms from an ontology and a distance function 
d{Oi, Oj) between two objects Ot and Oj, find k pairs of 
objects P such that for any (Oi, Oj) G P and (Og, Oh) ^ 
P,diO^,Oj) <d{Og,Oh). 

Figure[T|illustrates a particular instance of the problem. 
The ontology tree consists of 10 terms. There are 4 objects 
that are described by these terms. The object descrip- 
tions are given by Oi = {^1,^7}, O2 = {to, ii, ^4, is}, 
O3 = {^2,^3,^9}, O4 — {^6,^8}- An inverted index, i.e., 
mapping a term to set of objects can be maintained on the 
ontology itself (as shown in the figure). Thus, each node 
in the tree statically maintains a list L of the objects that 
are described using the term corresponding to the node. 
For example, the list of objects for Iq is (02)- We will 
use term and node interchangeably to denote the node in 
which the term resides. 



The edge weights on the tree decrease exponentially 
as the level increases. Concepts closer to the root of 
the ontology are less similar than concepts that share 
some common ancestors. For example, broader concepts 
such as "sports" and "politics" should be more dissimi- 
lar than relatively narrower concepts such as "football" 
and "cricket". The exponentially decreasing edge weights 
capture this notion. We highlight the fact that the expo- 
nential edge weighting function is an example, and not a 
necessity for the algorithms to work. They produce cor- 
rect answers for all edge weights. 

We denote the number of objects by N, the number 
of terms by T, the total information size (i.e., the total 
number of describing terms for all the objects) by D, and 
the number of object pairs queried by k. In Figure[T] N — 
4, T = 10, andD = 11. 

Our contributions in this paper are as follows: 

1. First, we propose the problem of finding top-fc most 
similar object pairs that are annotated with terms in 
a hierarchy in an onhne fashion. The terms may de- 
fine concepts in an ontology and objects may be de- 
scribed using the concepts. 

2. Then, we define and motivate three different distance 
functions (equivalently, similarity measures) that can 
be used to describe the similarity between a pair of 
objects. The minimum pairwise distance is useful 
for searching objects sharing a similar term (con- 
cept). The average pairwise distance can be used 
to query objects that are described using multiple at- 
tributes. The earth mover's distance (EMD) finds the 
best way of matching the terms from two objects and 
finds the distance corresponding to this best match- 
ing. 

3. Finally, we develop efficient algorithms to solve the 
problem using the above distances. We use lower 
bounds based on Li on reduced number of terms to 
speed up the computation of EMD. The Li distance, 
in turn, is computed progressively using a modified 
version of the threshold algorithm. For the minimum 
pairwise distance, we show that the top-fc query runs 
in 0{Tk log k) time, where T is the size of the ontol- 
ogy. For the average pairwise distance, we devise an 
efficient best-first search algorithm that avoids dis- 
tance computations by generating lower bounds in an 
ordered manner Experimental evaluations demon- 
strate the scalabihty and practicality of our algo- 
rithms. 

The rest of the paper is organized as follows. Section|2] 
describes the related work. Section |3] defines the term 



2 



distance and the different object distances. Sections |4] |5] 
and|6]present the different algorithms for finding the top- 
k pairs of objects using those distances. Experimental re- 
sults are discussed in Section |7] Section [8] concludes the 
paper. 

2 Related Work 

Heterogeneous and high-throughput data is becoming 
commonplace in the sciences and there is consensus that 
integration of this information is needed for new break- 
throughs. In all these cases, annotations are derived from 
a structured vocabulary or ontology. The Semantic Web 
(Pittp://www.semanticweb.org/) has defined a specific 
language, OWL (http://www.w3.org/2004/OWL/), for 
describing ontologies. In biology, genes are described us- 
ing Gene Ontology (GO) (http://www.geneontology.org/ll 
that annotates genes and gene products by three kinds 
of terms reflecting molecular functions, biological 
processes, and cellular components. Millions of ab- 
stracts in Pubmed (http://www.pubmed.gov/) are indexed 
using MESH terms (http://www.nl m.nih.gov/mesh/ ). 
WordNet |http://wordnet.princeton.edu^ is a lexical 
database that groups English words into cognitive 
synonyms (or synsets). Hundreds of other ontologies 
have been proposed over diverse application domains 
such as plant structures (http://www.plantontology.org/), 
description and publication of digital documents 
(pittp://www.dublincore.org^, and earth and the envi- 
ronment (http://sweet.jpl.nasa.gov/ontology/). A good 
compendium of different ontologies is maintained at 
http://www.ontologyonline.org/ 

A given ontology uses a number of different relation- 
ships to organize concepts or hierarchical relationships. 
Of these, "is-a" and "is-part-of" relationships are the most 
prevalent. The former describes a subsumption relation- 
ship while the latter represents how objects combine to- 
gether to form composite objects. Both of these lead to 
hierarchical structures in which the proximity between 
terms (concepts) grows as we descend down the hierar- 
chy. 

There have been numerous works on gene ontology 
ranging from gene function prediction using informa- 
tion theory JT8| to defining similarity among genes us- 
ing the full graph structure of GO |5 |. In [81, a compari- 
son of three different gene similarity measures were pre- 
sented. Probabihstic approaches have also been used lfT3l . 
Biologists have used average and minimum pairwise 
distances between genes based on GO for comparing 
co-evolutionary rates of yeast genes lfT4l and for co- 
clustering with gene expression data fT\ respectively. 



There are a number of similar efforts in the area 
of information retrieval where the similarity between 
documents is measured by considering the overlap of 
terms. The term-frequency inverse-document-frequency 
(tf-idf) measures consider the frequency of terms in docu- 
ments lU? |. Work on text matching showed that hierarchy- 
based measures using tf-idf outperform lexical similarity 
measures |15 |. Latent Semantic Indexing (LSI) |6| trans- 
forms documents into an Euclidean space indexed by la- 
tent semantic dimensions. EMD has been shown to be 
better than other measures in finding document similari- 
ties using the WordNet ontology 1 19 ,1 . 

Embedding an ontology into an Euclidean space Q and 
processing queries in the embedding space is another al- 
ternative. However, an object description will then span 
multiple points leading to possibly large MBRs. Further, 
the approach may suffer from high distortion of the em- 
bedding. 

In this paper, we tackle the computational challenge of 
answering queries efficiently using distances defined on 
hierarchical structures like ontologies. 



3 Distance Definitions 

3.1 Distance between Terms 

The distance d{ti^ tj) between two terms ti and tj is de- 
fined as the length of the path between them on the ontol- 
ogy tree. Since there is only one path between two terms 
in a tree, from the properties of the shortest path, this dis- 
tance is a metric (0]. 

An interesting and important point to note is that when 
the term distances decrease exponentially at each level, 
the distance between two terms at the leaves of two sub- 
trees can be approximated by the distance between the 
roots of the subtrees. For example, in Figure [T] where 
the edge distances are halved at each level, the distance 
between and (= 3) can be approximated by that be- 
tween ti and t2 (= 2). Using this term distance, we next 
define different distance measures between the objects. 
Once more, we emphasize the fact that our algorithms are 
general enough to work correctly with all edge weights, 
and not just the exponential function. 

We next define the three distance measures — dmin, 
davg and demd — ^between two objectsQ 



'We use the terms cimin and MinDist, davg and AvgDist, rfe^d and 
EMD interchangeably in the paper. 
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1: Minimum pairwise distances for the example in Table 3: EMDs for the example in Figure [T| 
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Table 2: Average pairwise distances for the example in 
Figure [U 

3.2 Minimum Pairwise Distance 

Definition 1 (Minimum Pairwise Distance). The mini- 
mum pairwise distance between two objects Oi and Oj, 
denoted by MinDist, is defined as: 



mm 

ueOifyeO: 



{d{U,tj)} (1) 



This distance is useful when searching objects that have 
similar terms. For example, even though a single biolog- 
ical document may contain references to different terms 
like photoreceptor cells and ganglion cells, it is useful 
to be able to retrieve it when another document that de- 
scribes photoreceptor cells is queried. 

This distance is of particular use in keyword search- 
ing, where the query document consists of only the sin- 
gle keyword, and all documents having that keyword will 
be returned with a distance of 0. MinDist, in general, ex- 
tends this idea by finding additional documents that con- 
tain terms most similar to the queried keyword. 

The MinDist measure is heavily used in hierarchical 
bottom-up clustering methods where in each step, two 
clusters with the minimum pairwise distance are merged. 
It has also been successfully used for finding the distance 
between two genes, where a gene is annotated with a set 
of terms from GO |2|. 

Table [T] shows the MinDist measures among the ob- 
jects in Figure [T] MinDist is not a metric distance as it 
does not maintain the triangular inequality. For example. 



^{Ol,Oi 
.(02,04 
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1.0 + 0.0 < 1.5 



3.3 Average Pairwise Distance 

Definition 2 (Average Pairwise Distance). The average 
pairwise distance between two objects Oi and Oj, de- 



noted by AvgDist, is defined as: 
1 



davg{0i,0j) 



m.\o,\ 



J2 ^2) 



where \Oi \ and \ Oj \ denote the number of tenns describ- 
ing Oi and Oj respectively. 

The AvgDist is useful in cases where the objects are 
not precisely defined. For example, it has been success- 
fully used for gene function prediction using GO terms 
for yeast genes [14] as well as in the domain of web ser- 
vices ifTTI . The MinDist measure fails in such cases. 

Table|2]shows the AvgDist measures among the objects 
in Figure [T] AvgDist is not a metric, as it fails to satisfy 
the identity property, i.e., davg{x, x) can be greater than 
(e.g., davg{Oi, Oi) = 1.25). However, since it follows 
symmetry and triangular inequalitjH, it can be considered 
as a pseudo-metric distance. 



3.4 Earth Mover's Distance 

Apart from the property of not being a true metric, 
AvgDist also suffers from the fact that each term in one 
object is matched with every other term in the other ob- 
ject. For example, consider two documents with the 
terms {war, sports} and {war, football}. Even though 
it is obvious that the distance between these two docu- 
ments should be small, the average distance unnecessarily 
compares "war" in the first document with "football" in 
the other. The earth mover's distance (EMD) [il2| rec- 
tifies this shortcoming by comparing only the like terms 
through finding the best matching between the terms of 
the two documents. For this example, EMD will match 
"war" with "war" and "sports" with "football" and aggre- 
gate these distances only. EMD has been shown to be 
better than other distances in finding similar documents 
using the WordNet ontology [19]. 

Formally, each object is considered to be composed of 
"mass" at the specific spatial locations (corresponding to 
the terms that describe the object) in the ontology. The 
total mass of each object is 1; consequently, the mass at 



-See Appendix for the proof. 
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Figure 2: Reduction of terms. 

each term location is inverse of the number of terms de- 
scribing the object. For example, Oi in Figure[T]will have 
mass i corresponding to terms ti and ty. 

The EMD between two objects A and B is the mini- 
mum work required to transform A to B, where one unit 
of work is equal to moving one unit of mass through one 
unit of distance in the ontology. Finding the best "flows" 
(i.e., how much mass needs to be moved from one term 
in A to another term in B) is a linear programming (LP) 
problem. 

Definition 3 (Earth Mover's Distance). The earth mover's 
distance between two objects Oi and Oj, denoted by 
EMD, is defined as: 

demd{Oi, Oj) ^ min ^ ^ Cpqfpq (3) 

s.t., each fpq > 0, 
Vtpeo,, X! fpq ^ '^''P''^"^'^^tgeOj, ^ fpg = Oj^ 

tqGOj tpGOi 

where Cpq is the ground distance between the terms tp and 
tq as per the ontology tree and Oi^ is the mass oftp in Oi. 

EMD is a metric when the ground distance is a met- 
ric (proof in |12|). Table [3] shows the EMDs among the 
objects in Figure [T] 

3.5 Comparison of the Distance Measures 

To compare the usefulness of the three distance mea- 
sures, we performed the following experiment. We used 
WordNet (|http://wordnet.princeton.edu/| as the ontology 
and the "bag-of-words" dataset from the UCI repository 
(http://archive.ics.uci.edu/ml/datasets/BagH-ofH-Words) as 
the set of objects. We chose the first 59 documents 
from the categories enron and kos of the bag-of-words 
dataset. Each document was described using nouns from 
the WordNet ontology, and the ontology was converted 
into a tree. The top-50 pairs were obtained using all the 
three distances. For EMD, on an average, there were 45 
pairs where both the objects were from the same category. 
Also, all 10 out of the top-10 pairs were of this nature. 



Object 


Before 

{to, ti,t2,t3,t4,t^,te,tj,ts,t()} 


After 

{to,ti,t2, t-i} 


Oi 
O2 
O3 
O4 


{0,i,0,0,0,0,0,i,0,0} 
{i,|,0,0,i, 1,0,0,0,0} 
{0,0,i,i, 0,0,0,0,0,1} 
{0,0,0,0,0,0,i,0,i,0} 


{0,^,^,0} 
{i, 1,0,0} 
{0,0, i,|} 
{0,0,1,0} 



Table 4: Reduction of terms using Figure |2] for example 
in Figure [T] 



The corresponding numbers for AvgDist distance were 23 
and 6 respectively. The MinDist returned 505 object pairs 
with distance as many objects shared one or more terms. 
Consequently, the top-fc lists returned were arbitrary. This 
convinced us of the quality of the EMD and its usefulness 
in finding the top-fc similar pairs of objects described by 
terms on tree ontologies. Nevertheless, the two other dis- 
tance measures have been proved to be useful in specific 
contexts [^,141. 

We next design algorithms to efficiently compute the 
top-fc pairs using these distances. We start with the EMD 
as it is the most interesting and useful measure. 

4 The Algorithm for EMD 

When the two sets contain objects each, the problem of 
finding top-fc pairs of objects can be solved by perform- 
ing 0{N'^) EMD computations. However, the prohibitive 
time required by each EMD computation makes the en- 
tire runningtime (P{N^) x 0{EMD) + 0{N^ \ogN)) 
impracticalo 

4.1 Lower Bound using Reduced Number 
of Terms 

When the ontology tree has a size of T, the ground dis- 
tance matrix is of size T^. However, we need not con- 
sider all the terms as we can prune the terms that are ab- 
sent in either of the object descriptions^ Thus, the num- 
ber of flow variables is quadratic in the size of the object 
descriptions. This is still impractical: the average time 
taken to compute d^md for objects of size 7 was found to 
be 54 ms|l 

Since the complexity of EMD depends mainly on the 
number of flow variables, which is quadratic in the num- 
ber of terms by which each object is described, the run- 

^The sorting of N'^ pairs require an additional 0{N^ log A') time. 

^The row and column sums for these terms in the flow mati'rx will be 
and hence all the flows will be individually as well. 

'All the times reported in the paper are based on a 3 GHz machine 
with 2 GB of RAM running Fedora Linux 9. 
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Figure 3: Example of ordered generation of object pairs 
for one dimension. 

ning time can be reduced if the size of the object descrip- 
tions is reduced. Figure |2] shows how such reduction can 
be accompHshed. The ontology tree is pruned at height 
1; only the root term and its immediate children remain. 
When a term thus deleted appears in an object description, 
it is replaced by its ancestor that is retained. Hence, all the 
terms in the dashed subtrees in the figure are removed and 
replaced by the root of the subtrees. The size of an object 
description is now upper bounded by the branching factor 
of the root. Table|4]shows the reduced object descriptions. 

The EMD between two objects calculated using the re- 
duced ontology is a lower bound of the EMD using the full 
ontology 112(3. The number of terms in the reduced on- 
tology is generally much less (say t ^ T), thus reducing 
the number of flow variables to t^. Since the complexity 
of linear programming is at least super-linear in the num- 
ber of flow variables, the running time of EMD decreases 
by a large factor of T"^ /t^. The number of distance com- 
putations, however, still remains 0{N'^). Next, we show 
how to reduce the number of distance computations. 

4.2 Li Lower Bound 

The Li distance, when scaled by the sum of the total mass, 
can be used as a lower bound for EMD [l]. Hence, the 
Li distance between two objects computed using all T 
terms, when divided by 2, serves as a lower bound for 
EMD between the objects. From now on, whenever we 
mention Li, we mean the scaled version of it which is a 
lower bound. Li on all terms, in turn, is lower bounded 
by Li on reduced number of terms. The proof uses the 
fact that \ai — hi\ + \aj — &j| > [(a^ + Oj) — [hi + 
i.e., when the values are combined, the difference of the 
sums is more than the sum of the differences. Therefore, 
Lu{0^,Oj) < Li^(0,,Oj) < EMDT{0^,0.i), where 
the subscripts denote the number of terms used. Since 
Li is much faster to compute (for 3 terms, it takes only 
0.002 ms), we can calculate a lower bound on EMD for 
each object pair and then use it as a filtering step to prune 
many of the pairs. 

*A lower bound can be obtained by pruning the tree at any height. 
However, there is a trade-off between the tightness and computational 
efficiency of the lower bound. 



The Li distance between two objects is a sum of the 
distances between the corresponding values in each di- 
mension; therefore, if the distances for all the object pairs 
are obtained and sorted for each dimension, the thresh- 
old algorithm (TA) [7| can be applied to obtain the object 
pairs with the least sum of distances or the least Li dis- 
tance in a progressive manner. The order of obtaining in- 
creasing Li distances can then be used as a guide to order 
the EMD computations of the object pairs. 

Obtaining a sorted list of object pairs for each dimen- 
sion requires 0{N^ log N) time. TA, however, also works 
when the next object pair in the list can be output in a 
sorted manner whenever needed. This avoids the 0{N^) 
computations. Hence, now our problem is reduced to 
outputting the next smallest pairwise distance whenever 
asked for in a particular list (or dimension). 

For this, we maintain two data structures for each di- 
mension: (i) a min-heap H that outputs the next best pair, 
and (ii) a list C that stores all the pairs that have been 
outputted from H. 

Initially, the objects are sorted and all A^ — 1 consec- 
utive object pairs (not necessarily OiOi+i) corresponding 
to A^ — 1 differences are inserted into H. Figure[3]shows 
an example. The 5 objects are sorted according to their 
values for the dimension that is being processed. Initially, 
H contains the 4 object pairs corresponding to the 4 dif- 
ferences in the sorted list. Whenever the next pair is asked 
for by TA, the minimum object pair from H is extracted 
and returned. It is also inserted into C. In this example, 
after the first call, O3O5 is extracted from H and inserted 
into C. Similarly, in the next call, O5O4 is extracted. 

The initial pairs are not sufficient though. There may 
be a non-initial pair (e.g., O3O4) with a value (3) less 
than that of an initial pair (O4O2 with value 6). However, 
the important point to note is that any non-initial pair is a 
combination of some of the initial pairs. Two pairs which 
have an overlapping object can be fused together to gener- 
ate a new pair. For example, 0,304 can be generated from 
O3O5 and O5O4 since O5 is overlapping. Further, a pair 
can never be the least pair until and unless the pairs from 
which it has been generated have been chosen (i.e., output 
from H). Therefore, in the example, O3O4 is added to 
H only after both O3O5 and O5O4 have been chosen. In 
general, when a pair OxOy is chosen, the contents of C 
are scanned and new pairs are generated if possible. If C 
has pairs of the form O^Ox and OyOz, new pairs O^Oy 
and Ox Oz are generated respectively and are inserted into 
H. The value of this new pair is the sum of the values of 
the pairs from which it is generated. 
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Algorithm EMD 

Input: Reduced object list O with t terms 
Output: Object pair list P 
1 . for dimension j — Itot 



2. Sort Oij (only j dimension) 

3. Insert N — \ differences into heap H[i 

4. List C[j] := $ 

5. Thresholds T[j] := 



6. end for 

7. P := $ (.-. P.dist (i.e., fc* distance in P) := oo) 

8. Threshold R := sum of all r[j] (therefore, R := 0) 

9. J := 

10. while R < P.dist 



1 1 . Extract minimum pair p from H [j] 

12. T[j] := difference for pairp 

13. if p is not seen earlier 

14. di := Li on t terms of p 

15. if di< P.dist 

16. (i2 := -^1 on all terms of p 

17. if d2< P.dist 

18. := EMD on all terms of p 

19. \idz< P.dist 

20. Insert p into P 

21. Update P.dist as new /c* distance 

22. end if 

23. end if 

24. end if 

25. end if 

26. Update R using T[j] 

27. Scan C[j] with p to generate new pairs F 

28. Addrtoi/[j] 

29. j := (j + 1) mod t 



30. end while 



4.3 Algorithm 

Figure |4] summarizes the entire EMD algorithm that uses 
TA with the lower bounding strategy. First, the Li lower 
bound using reduced number of terms is extracted from 
the heap (line 14). If it is less than the current fc* es- 
timate P.dist in P (line 15), the bound is improved by 
computing the Li using all the terms (line 16). If it still 
less (line 17), the exact EMD is computed (line 18) and 
the top-fc list is modified, if necessary (line 21). For each 
such Li -reduced computation, the threshold distance (R) 
is increased. When R > fc* distance in P, no other ob- 
ject pair can have Li-reduced distance less than the top-fc 
pairs akeady found. Therefore, the EMD distances will 
also be greater. Hence, the algorithm is then halted. 

4.4 Analysis of Time Complexity 

For each of the t dimensions, we incur the following cost. 
Initially, sorting takes O(A^logA^) timeQ Thereafter, 
inserting the — 1 elements in the heap takes 0{N) 
time. With each call to the heap, an extract operation 
takes O(logiV). At the i* iteration, at most i elements 
are added to the heap again. This takes 0{i log N) time. 
Thus, if we have k' calls (this k' is generally larger than k 
as many object pairs with low lower bound but high EMD 
are examined), the time per dimension is O(fc'^logA^) 
which leads to a total time of X]j=i log A^) or 

Oitk'inax log ^) where k,nax is the maximum of all fc^ s. 

Space Complexity: Heap j requires 0{N + k'^) space 
where fcj is the number of calls made on column j. Hence, 
the total space required is 0{t{N + k'^^^)) where fc^j^j, 
is the maximum of fc^ 's. 

5 The Algorithm for MinDist 

Unlike the demd distance, whenever two terms corre- 
sponding to two objects are encountered, the MinDist for 
the object pair can be estimated. If it is better than the cur- 
rent estimate, it is retained; otherwise, it is never needed 
again. We next explain the MinDist algorithm that ex- 
ploits this property. 

Any object pair (O;, Oj) having a lesser distance than 
(Og, Oh) must have a term pair [ti € Oi, tj € Oj) which 
has a lesser distance than any term pair {tg € Og,th G 
Oh). Hence, we only need to identify such term pairs 
{ti, tj) that are close and process their inverted lists, i.e., 
the list of objects. 

^An alternative approacli using liasliing tliat may reduce this time is 
discussed in Appendix. 



Figure 4: The EMD algorithm. 

A pre-processing step is required to build the inverted 
lists of objects at each node. The inverted index is needed 
to be built for the minimum and average pairwise dis- 
tances but not the earth mover's distance. For each object 
Oi, when a term tj appears in it, O^is inserted into the 
inverted list of tj. The list is accessed using hashing, and 
the object is inserted at the top of list. 

Figure|5]describes the entire algorithm. For a node, the 
MinDist algorithm computes the top-fc object pairs that 
are described by at least one term pair in its subtree. Any 
such object pair must either (i) be in the top-fc list of the 
children, or (ii) contains terms from different subtrees of 
the children. The recursive definition of the first kind al- 
lows us to employ a divide-and-conquer approach. For the 
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Algorithm MinDist 
Input: Node t 

Output: Pair list P of size k; 

Object list B of size 0{Vk) 

1. TB := list of objects in t of size 0{\/lt) 

2. c := number of children of t 

3. for i = 1 to c 

4. CB\i],CP[i\ := MinDistit. child\i]) 

5. Add t.edge[i] cost to each object in CB[i] 

6. end for 

7. B := Merge(CB[l], . . . , CB[c], TB) 

8. TP := GenPairs(S) 

9. P := Merge(CP[l], . . . , CP[c], TP) 



Figure 5: The MinDist algorithm. 

second kind, we need a list of objects that are close to the 
subtree of the children nodes. The lists can then be joined 
to generate the necessary object pairs. Thus, the MinDist 
algorithm computed at the root of the ontology returns the 
top-fc pairs. 

As shown in Figure|5] each node t maintains two lists: 
(i) a list of pairs of objects P ordered by their d,nin dis- 
tances; and (ii) a list of objects B ordered by their min- 
imum distances di to the node t. The length of P is at 
most k. The length of B should be enough to ensure that 
k distinct pairs of objects can be generated from B. The 
number of terms required to do that is k' = 0{\Vk~\ )0 

When MinDist is called on a node t, it selects k' objects 
from its list L into TB. L is the list of objects associated 
with that term. MinDist is then called on each of its c 
children. The cost of the edge from t to its child is added 
to the objects in the corresponding child's object list (line 
5). This is done to ensure that the distances are maintained 
correctly. The c sorted object lists and the list of objects 
in t are then merged to produce the sorted list B. 

The merging (line 7) is done using a heap data struc- 
ture f4|. The heap is initialized with c + 1 elements at 
position 1 of each of the child lists and the list TB. The 
minimum element is then extracted into B. Since all the 
individual lists are sorted, the properties of heap guarantee 
that the object extracted has the least dmin distance from 
this node. The object at the next position of the list from 
where this minimum object came is then inserted into the 
heap. This is repeated k' times. 

All the possible k pairs are then generated from the k' 
objects in B (method GenPairs in line 8). This list TP 
computes the k best distances of the object pairs which 

^Since k'{k' — l)/2 > k, the actual number of terms required is 
k' = [1/2 + ^l/4 + 2fc]. 



are not in any of the subtrees. 

TP is finally merged with the pair lists CP[i],i = 
1 . . . c from the children to produce the final pair list P 
using a heap in the same manner as above (line 9). 

5.1 Analysis of Time Complexity 

In this section, we analyze the time and space complexi- 
ties of the MinDist algorithm. 

We first analyze the time required to compute the in- 
verted index. The object descriptions are read once, and 
for each term in an object, the corresponding list is ac- 
cessed in 0(1) time using hashing, and the object is in- 
serted at the top of list in another 0(1) time. The total 
time required for this phase is, therefore, 0{D). 

We next analyze the running time for the main phase of 
the algorithm. Selecting k' objects in TB requires 0{k') 
time. Adding the child edge costs to each object in CB 
lists takes 0{k'c) time. 

At every step of the merging operation, the object with 
the minimum distance is extracted from the heap and an- 
other object is inserted. The size of the heap is, therefore, 
never more than 0{c). Extracting the minimum element 
and inserting another object into the heap takes O(logc) 
time. Since the operation is repeated k' times, the total 
running time of the merging procedure is 0{k' log c). 

If, however, the objects in the child lists are not unique, 
k' operations may not be enough to select k' different ob- 
jects. Thus, a hashtable is used to ensure that an object 
is inserted into the heap only once. First, all the lists are 
scanned in O(fc'c) time. If an object appears for the first 
time, it is inserted into the hashtable with the object iden- 
tifier as the key and the distance as the value. If an ob- 
ject appears twice, the one with the minimum distance is 
maintained in the hashtable. Before any object is inserted 
into the heap, the hashtable is checked. If this object is 
different from the one maintained in the hashtable, then 
there exists another copy of this object with a smaller dis- 
tance. Hence, this object does not need to be considered. 
This limits the number of heap operations to 0{k'). As- 
suming that the hashtable operations take constant time, 
the running time then is 0{k' log c). 

Sorting k local object pairs requires 0{k log fc) time. 

Finally, the sorted pair lists at the node and the children 
are merged in 0{kc + fclogc) time using a heap and a 
hashtable in a similar manner as before. 

Thus, the total running time of the MinDist algorithm 
at a node with c children is 0{kc + k log k). 

The algorithm is run once at each node of the ontology. 
Assuming that there are T terms in the ontology, the total 
number of children for all the nodes is 0{T). Hence, the 
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amortized cost is 0{Tk + Tk log k) = 0{Tk log fc). 

The total running time of the MinDist algorithm is, 
therefore, 0{D + Tk log k). 

Space Complexity: Each node in the ontology contains 
an object list of size O(fc') and a pair list of size 0{k). 
Once these lists are sent to the parent, they are no longer 
required. Thus, at any time, the space requirement at a 
node is 0{c{k' + k)). The total space complexity, there- 
fore, is 0{cmax{k + k')) where c^ax is the largest branch- 
ing factor of a node in the tree. The inverted index requires 
0{D) space for storage. 



6 The Algorithm for AvgDist 

Unlike the MinDist algorithm that needs to maintain only 
one term pair for each object pair, the davg distance needs 
to remember all the possible term-pair distances. Conse- 
quently, it runs in two phases: (i) the Build phase, when 
pertinent information about objects are collected at the 
root in a bottom-up manner, and (ii) the Query phase, 
when such information is used to identify the top-fc pairs 
in a top-down order. 

For any pair of objects, there are two types of costs that 
need to be accumulated. The first is the across-tree costs, 
i.e., the distances between the describing terms that oc- 
cur in different subtrees of the root, and the second is the 
within-tree costs, i.e., the distances between the describ- 
ing terms that are within the same child of the root. For 
example, in Figure[Tl the total pairwise term distances for 
(Oi, O4) can be broken into 2 parts: (i) the across-tree 
distances between ti of Oi and t^, ig of O4 in the dif- 
ferent subtrees under ti and t2 respectively, and (ii) the 
within-tree distances between t-j of Oi and t^, tg of O4 in 
the same subtree under ^2- 

To estimate the across-tree distances for object pairs at 
a node, the following information need to be calculated 
for each object: (i) the number of terms in the subtree 
that describe the object, and (ii) the total distance of all 
such terms to this node. This information is accumulated 
at the root of the ontology by the build phase AvgDist- 
Build, which we describe in Section |64l After this phase, 
the root has collected the following tuple for each object: 
{Oi, rii, Wi). 

Before describing the two phases of the algorithm, we 
explain how lower bounds for the across-tree costs of an 
object pair can be computed using the above information 
and how such lower bounds can be generated in an or- 
dered manner 



6.1 Lower Bounds for Across-Tree Costs 

The estimates of the across-tree distances of a pair of 
objects Oi and Oj at a node t depend on the occurrences 
of their describing terms. The span of an object is 
defined to be the number of subtrees of the root where its 
constituent terms occur It can be either single, i.e., its 
terms occur in only one subtree, or multiple, i.e., its terms 
occur in multiple subtrees. Based on these, 3 different 
cases need to be considered. In each case, we would like 
to write the bounds at a node in terms of the parameters 
maintained for Oi and Oj at the node, i.e., in terms of 
{Oi, rii, Wi) and {Oj, nj, wj). 

Case 1: Both the objects have single spans. Two sub- 
cases need to be considered. 

Sub-case 1(a): The objects are in the same subtree. The 
across-tree cost is and nothing can be concluded about 
their distance in the subtree without descending deeper 
into the subtree. Hence, the lower bound is 

dib = (4) 

Sub-case 1(b): The objects are in different subtrees. In 
this case, the across-tree distance can be estimated ex- 
actly. The distance between a term ti G Oi and tj E Oj is 
d{ti,t) + d{tj, t) where t is the node at which this lower 
bound is being computed. The total across-tree distance 
is obtained by adding all such combinations of terms: 

|o,||o.| 10.110, 

J— 1 i—1 i—1 j—1 

10.1 lo.i 
=nj.'^d{U,t) + nj. d{tj,t) 

i=l 1=1 

=nj.Wi + rii.Wj (5) 
Thus, the average distance is 

difc = d = — + — (6) 

Hi Tlj 

Since the within-tree distance for this pair is 0, this is the 
exact distance. 

Case 2: Both the objects have multiple spans. The min- 
imum across-tree distance can be estimated in a manner 
similar to that in Case 1(b). There are at least two pair- 
ings of terms of Oi and Oj that are in different subtrees. 
Using Eq. (|5]l, the total across-tree costs for these pairings 
are Wjj^ .rij^ + Wj^ .Ui-^ and Wi^ .rij-^ + Wj-^ .rii^, where n^j^, 
Wi^ etc. are the number of terms of Oi in one subtree and 
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its total distance to t from that subtree. The values of rti^ , 
71^2, Tijj, and ri j^ are at least 1. Thus, the total across-tree 
distance is at least Wi-^^ +Wi2 +^^^2 = + The 
lower bound for the average pairwise distance, then, is 



ni.rij 

Case 3: One object Oi has a single span, and the other 
object Oj has a multiple span. Similar to Case 2, there is 
at least one subtree containing terms of Oj but not con- 
taining terms of Oi. The total across-tree cost is then the 
minimum of Wi.rij-^ + Wj-^ .Ui and Wi.rij^ + Wj^ .rii. Sim- 
ilar to Case 2, there is at least one term of Oj that is not 
in the same subtree of Oi. Thus, Uj-^ and Uj^ are at least 
1 . However, without knowing where the terms of Oj oc- 
cur, nothing can be concluded about Wj-^ and Wj^ . Since 
the terms may occur at the node itself, the estimates for 
Wj-^ and Wj^ are 0. Hence, the total distance is at least Wi 
producing a lower bound of 



rii-Tij 

6.2 Generating Ordered Pairs 

Though the above mentioned lower bounds can be com- 
puted for a given pair, the cost of computing them for ev- 
ery pair is 0{N'^). We would like to avoid such costly 
online operations. The trick is to separate the parameters 
of Oi and Oj in each lower bound such that they can be 
systematically generated in an ordered manner whenever 
needed. The order of generation will guarantee that at any 
point of time, the lower bounds of the pairs not exam- 
ined will be greater than or equal to the lower bounds of 
the pairs already generated. In this section, we will dis- 
cuss ways to achieve this for each of the cases mentioned 
above. 

To identify pairs of objects in the same subtree (Case 
1(a)), c + 1 different lists are maintained at the root corre- 
sponding to itself and its c children. 

To handle Case 1(b), each of these c + 1 lists of objects 
are sorted by the average distance Wi/rii. Given two such 
sorted child lists, it is guaranteed that the lower bound 
(which is the sum of the distances) for an object pair at 
positions pi in the first list andpj in the second list is lower 
than the estimate of every pair whose positions are > pi 
and > pj. Thus, every time a pair at positions {pi,pj) 
is inspected, only its immediate successors {pi + l,Pj) 
and {pi,Pj + 1) need to be considered. Since there are 
c + 1 child lists, the number of possible ways of pairing is 
c(c + l)/2. 



The lower bound for Case 2 is not easily separable in 
terms of parameters of Oi and Oj. It is, however, sep- 
arable if for an object pair, the number of terms for the 
objects (i.e., nj, Uj) are known a priori. To do that, the 
list of objects with multiple spans is partitioned such that 
each partition contains objects with a particular n^. Pair- 
ing Oi and Oj and knowing which partitions they come 
from immediately defines the denominator of the lower 
bound. Thus, if there are r partitions, sorting each par- 
tition by Wi and performing pairings in the same way 
as done for Case 1(b) orders the pairs according to their 
lower bounds. 

Case 3 is handled similarly. The single-span list is bro- 
ken into c + 1 lists and the multiple-span list into r par- 
titions. Generating all r(c + 1) pairings gives the lower 
bounds in an ordered manner 

We next describe how the Query phase of the AvgDist 
algorithm uses these lower bounds. 

6.3 Query Phase 

The AvgDist-Query procedure (Figure|6]l is run at the root 
of the ontology. It outputs a list P of top-fc object pairs. 
When the size of P is less than k, P.dist is oo; otherwise, 
it is maintained as the /c* largest distance in P. 

The list L of objects is broken into c + 1 + r lists cor- 
responding to single and multiple spans as explained in 
the earlier section. From these lists, the initial pairs with 
the lower bounds are generated (method GenlnitialPairs in 
line 6) and put into a heap H. See Section l672l for details 
on how to generate these pairs. 

The top-down searching for object pairs proceeds in a 
manner where at every stage, only the current "best" pair 
is examined iTOl . Thus, this search strategy is called the 
best-first search. 

The algorithm progresses by extracting the current best 
pair from the heap, i.e., the pair p with the current best 
lower bound (line 8). If the lower bound is an estimate 
for p and not an exact distance as in Case 1(b), the bound 
can be improved in two ways (line 1 1). First, the within- 
tree costs at the subtrees in the next level can be estimated 
again using Eqs. (|4l[8]) by descending into the subtree (de- 
noted as AvgDist-NextEstimate). The descent is made in 
a breadth-first order on the tree0 The second way is to 
compute the term-wise distances fully without resorting 
to recursion (denoted as AvgDist-Complete). This, how- 
ever, disregards the structure of the ontology. 

If the exact distance of p is computed, the list P is ex- 
amined. If the fc* distance in P is more than that of p (hne 

'Any order, e.g., depth-first order, will also work. However, if the 
edge distances decrease exponentially, breadth-first ordering produces 
better bounds. 
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Algorithm AvgDist-Query 


Input: Node root 


Output: Pair list P of size k 


1. L 


:= list of object mappings in root 


2. P 


:= <i> (therefore, P.dist := oo) 


3. c: 


= number of children of root 


4. r 


= number of partitions of objects 


5. Divide L into c + 1 + r lists B 


6. A 


:= GenInitialPairs(i3) 


7. Insert each a ^ A into heap H 


8.p 


= Pop(H) 


9. while p. dist < P.dist 


10. 


if Done(p) = false 


11. 


p := UpdateEstimate(p) 


12. 


end if 


13. 


if Done(p) = true 


14. 


it p.dist < P.dist 


15. 


Insert p into P 


16. 


P.dist := A;* distance in P 


17. 


end if 


18. 


else 


19. 


Insert p into H 


20. 


end if 


21. 


A := GenNextPairs(p, L) 


22. 


Insert each a ^ A into H 


23. 


p := Pop(H) 


24. end while 



Figure 6: The Query phase of the AvgDist algorithm. 



14), p is inserted into P and P.dist is modified. The size 
of P is maintained to be at most k by removing the pair 
with the largest distance. 

If, however, the lower bound of p is still an estimate, 
p is re-inserted back into the heap H (line 19). The next 
pairs are generated from the c + 1 + r lists (method Gen- 
NextPairs in line 21 as described in Section |6T2] i and in- 
serted into the heap (line 22). 

In the next iteration, the pair which is now the best is 
examined (line 23). If this pair has a distance more than 
the fc* distance in P (i.e., P.dist), it is guaranteed that all 
the pairs currently in the heap and all the pairs that are not 
generated will have a greater distance. This is due to the 
properties of the heap and the ordered nature of generating 
the pairs from the c + 1 + r lists. Thus, the algorithm is 
then terminated correctly. 



Name 


Number of 


Number of 




GO Terms (T) 


Genes (N) 


Process 


13762 


3437 


Function 


7803 


1958 


Localization 


1990 


645 



Table 5: The Gene Ontology (GO) datasets. 



6.4 Build Phase 



In this section, we describe how AvgDist-Build computes 
the information (Oi, ni, Wi) for an object^ Each node 
t maintains an inverted list L of objects Oi described 
using t. First, it converts L into B by making n-i — 1 and 
Wi = Q for each Oi G L. Then, it calls AvgDist-Build for 
each of its children. For each list CB that it receives from 
a child, and for each object Oj G CB, it modifies Wj by 
adding to it the distance to the child node multiplied by 
the number of times Oj occurs in the child subtree, i.e., 
Wj — Wj + dist X Uj, where dist is the edge distance 
from t to its child. This ensures that the total distance 
from t is maintained correctly, since each of the rij 
objects have to traverse the distance dist. 



Analysis of Space and Time Complexities: Assume the 
total size of the object description to be D which is at most 
N X T where N is the total number of objects, and T the 
total number of terms. The inverted index requires 0{D) 
time and space to construct. We next analyze the space 
and time complexity of AvgDist-Build in terms of these 
parameters. 

Each object's information is stored at the terms describ- 
ing it. The information stored in a term is repeated along 
all its ancestors. Since the size of the description is D, 
and there are O(logT) ancestors (assuming the ontology 
to be balanced), the storage cost is 0{D log T). 

The running time can be analyzed similarly. At the leaf 
level of the tree, there are D describing terms. When this 
0{D) information is sent up to the next level, the time 
required to combine the information is still 0{D) since 
each object description is read only once and is matched 
using a hashtable to the information already computed. 
Assuming the height of the tree to be O(logT), the total 
running time is 0{D log T). 
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7 Experiments 

7.1 Datasets 

We have experimented with real as well as synthetic 
datasets. The real dataset is that of Gene Ontology (GO, 
"http ://w w w.geneontology. org/ ) . There are three ontolo- 
gies in GO, corresponding to biological process, molec- 
ular function and cellular component (locahzation) of 
terms. The details of the three ontologies are given in Ta- 
ble|5] The datasets were curated by hashing gene descrip- 
tions using their bit-vector representations of the terms 
and removing the identical genes. 

The synthetic datasets were generated by controlling 
the number of objects, the number of terms, the aver- 
age branching factor of the ontology tree and the average 
number of terms per object. The ontologies and the object 
datasets are created separately. Ontologies have a fixed 
size and an average branching factor Starting from the 
root, we generate a random number of children by per- 
turbing the average branching factor within some limits. 
We continue with this at all successive nodes. The object 
dataset is generated with a fixed number of objects and 
an average number of terms per object. Again, a random 
number is generated from the average by perturbing it. 
Then, terms are picked from the ontology randomly with- 
out replacement for the required number of terms. This 
process is repeated for all objects. 

7.2 Experimental Setup 

When the distance function between the objects is defined 
as the earth mover's distance the following schemes were 
evaluated: 

• Li-reduced: In this scheme (Section |4|i, the Li on 
reduced number of terms is used. 

• Li-full: In this scheme, the Li on all terms is used. 
The tree is not pruned at a height 1. 

• EMD-reduced: All the 0{N'^) EMDs on reduced 
number of terms are computed. These are then used 
to prune those object pairs for which the reduced 
EMD is greater than the fc* best HMD already found. 

• Brute-force: In this scheme, all the 0{N'^) pairs are 
computed and then the top-fc pairs are returned. 

The performance of the brute-force scheme (267 s for 
N = 100 objects) is too impractical to be of any use and 
are, therefore, not reported. Also, the times of Li-full are 

* ''Figure [TTl in Appendix outlines the algoritlim. 



not reported since, in the best case, it can only save Li- 
reduced computations, which are very fast anyway. In all 
the experiments, it was actually worse than Li-reduced. 

When the distance function between the objects is 
defined as the minimum pairwise distance between the 
terms, the following schemes were considered: 

• MinDist: This is the scheme described in Section |5] 
that has a running time of 0{Tk log k). 

• Brute-force: In this scheme, all the 0{N'^) pairs 
are computed and then the top-fc pairs are returned. 
Maintaining a heap of size at most fc gives the run- 
ning time of this scheme to be 0{N^ log k). Due to 
the exorbitant online costs of it, this scheme is not 
practically useful. 

For N = 10*, the top-fc computation using the brute-force 
algorithm finishes in ^300 s. Since the MinDist has a bet- 
ter running time, we report the experiments for MinDist 
only. 

When the distance function between the objects is de- 
fined as the average pairwise distance between the terms, 
the following schemes were evaluated: 

• AvgDist-NextEstimate: In this variant of AvgDist, 
the estimate for the best-pair is improved by progres- 
sively descending into the subtrees and estimating 
the across-tree costs at the roots of those subtrees. 

• AvgDist-Complete: This is the other variant of 
AvgDist where the exact distance is computed at one 
go by computing all the pairwise term distances. 

• Brute-force: In this scheme, all the 0{N'^) pairs are 
computed and then the top-fc pairs are returned. 

The performance of the brute-force scheme (300 s for 10'' 
dataset) is much higher than that for AvgDist schemes. 
Consequently, it is not discussed any further. 

Sections l7.3l to IXTl report experiments on EMD while 
Sections |T8] to 17.101 and 17.111 to 17.141 report on MinDist 
and AvgDist respectively. 

7.3 Effect of k on EMD 

Figure |7] shows the effect of fc on the running time of GO 
localization dataset. When fc is increased, more number 
of Li computations are needed before the TA can halt. 
Consequently, more number of EMD calculations are also 
required. 
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Figure 7: Effect of k on EMD. 

EMD: Synthetic dataset, T=2.5x10^, l<=5 
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Figure 8: Effect of N on EMD. 

7.4 Effect of N on EMD 

Figure |8] shows that the scalability of our algorithm with 
N is better than quadratic. Even though the number of 
objects increases quadratic ally, due to Li lower bounding, 
many of the object pairs are pruned. Consequently, the 
number of full EMD computations increases by a lower 
factor. Also, even for N — 350 which translates to 6 x 10"' 
object pairs, our algorithm finishes in only 55 s. 

7.5 Number of Object Pairs for EMD 

To check the effect of increasing N, we measured the ra- 
tio of object pairs for which full EMD computation was 
done. The ratio was measured as number of pairs investi- 
gated to the total number of possible pairs {N{N — l)/2) 
and is denoted by ry. As Figure|9]shows, 77 decreases when 
N is increased. For N = 250, the number of EMD com- 
putations becomes lower than 10 %. 

7.6 Effect of Ton EMD 

The next experiment measures the effect of the total num- 
ber of terms on the EMD computations. Since both the 
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Figure 9: Effect of N on number of pairs examined for 
EMD. 

EMD: Synthetic dataset, N=50, T=2.5x10^, k=5 




Figure 10: Effect of t on EMD. 

Li and EMD-reduced depends only on the reduced num- 
ber of terms, the effect of T is minimal (graph not shown). 

7.7 Effect of ton EMD 

As the number of children of root, i.e., t increases, the 
complexity of the TA increases linearly. Figure [TOl shows 
the running times for varying t. The size of each object 
description is limited to 10. When t < 10, the time in- 
creases. The EMD-reduced behaves in the opposite man- 
ner This is due to the interaction of two opposing effects: 
as t increases, each computation takes more time, but the 
lower bound gets tighter as more number of terms are 
taken into account resulting in less number of full EMD 
computations. However, when t > 10, since there are at 
most 10 terms in each object, the object description size 
do not get reduced and each EMD-reduced computation 
takes as much time as the full EMD computation. Since 
0{N'^) of these computations are performed, the running 
time shoots up. The ii -reduced, on the other hand, shows 
only a little increase. 

The next set of experiments measure the effect of dif- 
ferent parameters on the MinDist algorithm. 
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MinDist: Gene Ontology (GO) datasets 
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Figure 1 1 : Effect of k on MinDist. 
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Figure 12: Effect of T on MinDist. 

7.8 Effect of k on MinDist 

The first set of experiments measure the effect of the num- 
ber of top pairs queried (k), on the running time of the 
MinDist algorithm. As shown in Figure (TT] the scala- 
bility of MinDist with k is linear. The analysis done in 
Section lsTl shows that for small values of k, this is the ex- 
pected behavior The largest real dataset — GO process — 
finishes in less than 1 s for k < 50, demonstrating the 
effectiveness of the algorithm. 



7.9 Effect of T on MinDist 

We next report the effect of the number of terms T on 
the running time. Figure [12] shows that increasing T in- 
crements the running time of MinDist linearly, indepen- 
dent of the value of k. We also note the practicality of 
the MinDist algorithm. For a very large dataset of size 
N = 10^ and a very large tree of size T ~ 10^, a top- 
100 query finishes in about 100 s. For smaller fc's and for 
smaller T's, the running time is in seconds. 
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Figure 13: Effect of k on AvgDist. 

7.10 Effect of N on MinDist 

The running time analysis of the MinDist algorithm shows 
that it is independent of the number of objects TV. When 
the number of terms T is kept constant, the experiments 
confirm that the running time is practically constant even 
when N is increased from 10^ to 10^ (graph not shown). 

The next set of experiments evaluate the performance 
of the two variants of AvgDist. 

7.11 Effect of k on AvgDist 

The first experiment on AvgDist illustrates the effect of k 
on the running time of the Build phase and the two differ- 
ent variants — NextEstimate and Complete — for the two 
larger GO datasets. All the six curves in Figure[T3]are rel- 
atively flat, showing that the effect of k is minimal. Intu- 
itively, the running time of AvgDist depends on the actual 
number of object pairs investigated. For the GO datasets, 
even for even large fc's up to 100, this remains almost con- 
stant. Moreover, the Build phase takes negligible time in 
comparison to the Query phase. 

7.12 Number of Object Pairs for AvgDist 

We further investigated the effect of fc by measuring the 
number of object pairs that are examined in the Query 
phase of the AvgDist algorithm. For this, we increased 
fc up to 10000. Figure fT4l shows that 77 (i.e., the ratio to 
the total number of possible pairs) increases very slowly 
with fc. The results are robust across different values of T 
(as shown in the figure) and N (not shown). This is the 
reason why the running time is also constant across fc. 

The NextEstimate method examines less than 2% of 
the total number of pairs. The Complete method inves- 
tigates more object pairs (about 7%) than the NextEsti- 
mate method. Computing a distance for the current best- 
pair guarantees that only those pairs which have a bound 
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Figure 14: Effect of k on number of pairs examined for 
AvgDist. 



AvgDist: Synthetic dataset, k=1 
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Figure 15: Effect of N on number of pairs examined for 
AvgDist. 

lower than this distance will be analyzed. For the NextEs- 
timate method, the distance of the best-pair is computed 
progressively, thereby saving on full AvgDist computa- 
tions as compared to the Complete method, which finds 
the actual distance of the best-pair. 



7.13 Effect of N on AvgDist 

We next discuss the experimental results when the number 
of objects is varied. We first measure the effect of number 
of objects on the Build phase. From the analysis done in 
Section |6.4[ we expect the running time to grow linearly 
with the size of the input information. Assuming that the 
number of describing terms for an object is constant, the 
size of the information is directly proportional to the num- 
ber of objects. The experiment shows that the scalability 
is indeed linear (graph not shown). 

The next experiment (FigurefTSll shows that the number 
of pairs investigated grows at most quadratically with N. 
Since the objects are generated using the same random 
process, this is expected. 



7.14 Effect of T on AvgDist 

The next set of experiments measure the effect of the num- 
ber of terms T on the different components of the AvgDist 
algorithm. Figure[T6]shows the time taken to complete the 
Build phase. Note that this phase takes the same amount 
of time regardless of the choice of the method for estimat- 
ing the distance of a pair Since the build procedure is run 
at each node, the effect of T is linear Further, as can be 
seen from the plot, when the number of objects increase, 
more information needs to be processed at each node and 
the running time increases linearly. 

The next experiment measures the number of pairs in- 
vestigated against different values of T. As shown in Sec- 
tion 17.131 the number of pairs depends primarily on the 
distribution of the objects on the tree — mainly the num- 
ber of objects falling in the single span lists — and not on 
the size of the tree. Consequently, the size of the tree T 
has no appreciable effect. Similar to the previous set of 
experiments, this effect of T (or rather the lack of it) is 
directly reflected in the running time as well. The running 
time is essentially independent of T (graph not shown). 



8 Conclusions 

In this paper, we proposed the problem of finding top-fc 
most similar object pairs annotated with terms from an 
ontology. The terms represent concepts and the objects 
are described using these concepts. The join problem ex- 
posed the computational aspects of the domain well. 

We then defined and motivated three object distances 
that can be used to define the dissimilarity (or, equiva- 
lently similarity) between a pair of objects. The minimum 
pairwise distance is useful in order to search objects that 
share a similar term. The average pairwise distance cap- 
tures the notion of similarity when the object definitions 
are imprecise or when objects need to be compared on 
multiple attributes. The third one, earth mover's distance, 
is particularly useful as it finds the best way of matching 
terms in one object with those in the other by capturing 
the term-to-term relationships, and measures the distance 
corresponding to this best matching. 

Finally, we designed algorithms to efficiently solve the 
problem using all the above distance measures. The al- 
gorithm for EMD uses Li distance as a lower bound and 
even avoids all Li computations by modifying the thresh- 
old algorithm. The algorithm that solves the problem for 
the minimum pairwise distance runs in 0(Z) + Tk log k) 
time. For the average pairwise distance, we devised a 
best-first search strategy that avoids all pairs investigation 
by generating lower bounds in an ordered manner Ex- 
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Figure 16: Effect of T on Build phase of AvgDist. 

perimental evaluations demonstrated the practicality and 
scalability of our algorithms. 

In future, we would like to design algorithms for other 
distance measures and lower bounds. We would also hke 
to develop methods that use term statistics to improve 
the expected running time and further explore the opti- 
mal height of pruning the ontology tree for EMD. Lastly, 
algorithms for fc-NN and range queries should be simple 
extensions of the proposed algorithms. 
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Appendix 



A Average pairwise distance follows 
triangular inequality 

Lemma 1. The average pairwise distance davg as defined 
in Eq. (|2| follows the triangular inequality property. 

Proof. Assume any three objects A, B and C. We need 
to prove that davg{A, B) + davg{B, C) > davg{C, A). 

Consider any term G A, bj £ B, and Ck G C. 
Since the term distance function is a metric, we can write 
d{ai,bj) + d{bj,Ck) > d{ck,a.i). Adding the |A|.|B|.|C| 
equations together yields 

|A|,|B|,|C| |A|,|B|,|C| 

d{a^,bj)+ d{bj,Ck) 

i,j,k—l 

\A\,\B\,\C\ 

> Y d{ck,ai) 

Z,J,fc — 1 

l-B|,|C| 

\A\. Y dibj,ck) 

j,k=l 

\C\,\A\ 

> \B\. J2 d{ck,a,: 



or, \C\. Y d{ai,b.j 



fc,i=l 



Dividing by |^|.|i3|.|C|, we get 



,iA,B)+da.g{B,C)>da.giC,A) 



□ 



B Hashing 

If Li is computed on all the terms in the TA phase of the 
EMD algorithm, then the time required for sorting of N 
objects in the initial phase can be saved. The key is to 
observe that all values for an object will be of the form 
1/c where c is the count of the number of terms in the 
object. Since c is at most T, a hashtable of size T with 
keys i, • • • , ^ can be maintained. The N object values 
will be hashed into it. The heap H will be filled up with 
values of the form i — only. This requires a running 
time of 0(iV + T) instead of 0{N log N). 

When reduced number of terms are used, the values 
will be of the form where 1 < ti < T and 1 < < T. 
This requires a running time of 0{N + T^). 



Algorithm AvgDist-Build 
Input: Node t 
Output: Object Hst B 

1. L := list of objects in t 

2. B := Modify(L) 

3. c := number of children of t 

4. for i = 1 to c 

5. CB[i\ := AvgDist-Build(t.c/iiZd[i]) 

6. for each co e CB[i] 

7. if 3o := Find{co.id, B) 

8. o.dist := o.dist + co.dist 

+co.count X t.edge[i] 

9. o.count := o.count + co.count 

10. end if 

1 1 . end for 

12. end for 



Figure 17: The Build phase of the AvgDist algorithm. 



C Algorithm AvgDist-Build 
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